---HackOn(Data) Contest Challenge, Sep 10-11, 2016
Qian(John) Xie, Mingfei Cao, and Roland Sing
HackOnData is a free two-day event that brings together the Toronto data community to take a closer look at the data that touches our daily lives. Teams of Toronto's top data scientists and data engineers collaborated to generate practical insights from data provided by local companies, not-for profits and the government. Prior to the event, weekly workshops and challenges help prepare participants by giving them the knowledge and hands-on experience required to ensure they can meaningfully participate. During the event, well-known mentors from Toronto and around the world engaged with participants to take their knowledge and skill to the next level. HackOn(Data) is the best platform available to local data talent and businesses to meet, collaborate, and exchange knowledge and experience.
Sponsers
TranQuant, flipp, wattpad, LoyaltyOne, amazon, Lightbend, GuruLink, Shopify
Partners
Toronto Apache Spark, scalator, Deep Learning Toronto, HackerNest, HacherNest Toronto Tech Socials, TechToronto, DMZ, City of Toronto, Toronto Public Library
In 2011, City of Toronto launched the TO360 Wayfinding Project. The integrated multi-modal wayfinding strategy is comprised of pedestrian, vehicular, cyclying and transit wayfinding. The project is aimed to:
The project is implemented in three phases:
Right now, the project in in phase 3. In determining where wayfinding products are required, a number of factors were considered:
Existing Need - The implementation strategy prioritizes areas where a need for wayfinding currently exists based on:
Available Funding - Further, certain areas may be prioritized as project partners come forward with funding to implement the schem. Potential project partners include:
For the HackOn(Data) event, City of Toronto have an interest in exploring a more data-driven methodology to determine the timing and geographic distribution of the required TO360 map assest upgrades. The data-driven methodology may help gain valuable insights from a different and novel perspective and help domain experts to make more effective and reliable map placement plan.
Reference: Toronto Wayfinding Strategy
We choose an expoloratory approach for this problem. We are not aiming to find a "perfect" solution by considering all needs and using all the available data. Instead, our goal is to build a prototype model using the a few of the most important data sources(provided by TranQuant and City of Toronto Open Data). If we can gain insights from the solution and the methodology is actionable, we can refine the prototype methodology by taking into consideration more needs, incooporating more data sources, and using more advanced algorithms.
Among the four aspects of the multi-modal wayfinding strategy, we focus on the pedestrian wayfinding. We chose to analyze two datasets from the City of Toronto, available on TranQuant:
Aftering deciding to use the above two datasets, the study began with a single question: what criteria should we consider when choosing an intersection to place map?
In the begining, we considered the problem as an optimization problem. After some discussion, we decided this direction is possible but might be difficult to implement in a two-day time limite. Then we did some exploratory data analysis by plotting the intersections and facilities geographic distribution on map. After seeing the plot, we thought clustering is a direction we should try. The logic is as follows:
The logic is heuristic but simple and easy to implement, so we managed to build a solution based on this logic within the two day time limit.
I tried two clustering algorithms: DBSCAN and KMeans. DBSCAN is density-based clustering while KMeans is centroid-based clustering
At this stage, I adopted KMeans algorithm. KMeans works better to spatially evenly divide our facilities into clusters according to their geographic coordinates. DBSCAN can provide another perspective of looking at the problem and we can explore that angle in the future.
The following are the major data science related packages used for this project:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.cm as cm
import seaborn as sns
import folium
from folium import plugins
from sklearn.cluster import DBSCAN, KMeans
from sklearn import metrics
from scipy import spatial
import time
from IPython.display import Image
from IPython.core.display import HTML
%matplotlib inline
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
ped_vol_2012.csv: pedestrian and vehicle volume data collected at some intersections in 2012 ped_vol_2013.csv: pedestrian and vehicle volume data collected at some intersections in 2013ped_vol_2014.csv: pedestrian and vehicle volume data collected at some intersections in 2014ped_vol_2015.csv: pedestrian and vehicle volume data collected at some intersections in 2015signalizedTrafficPedestrianVolumes.csv - pedestrian and vehicle volume data collected at some intersections from 1999 - 2012Cultural Facility Data in the City of Toronto The same data was provided in three formats
MSFC_44_Wards_Complete_Final.csv: List of cultural facilities in Toronto. The file consists of the name of the facility, address, ward information, ownership of facilities that are available on a rental basis for cultural events. There two accompanying documents MSFC_Readme_1.csv and MSFC_Readme_2.csv give detailed describe the contents of the file and the schema(columns) of the file. The MSFC_44_Wards_Complete_Final.csv only provides postal code, no coordinates for facility.
Make_space_for_culture_mtm3.zip: with shape files of MTM 3 coordinate system and facility file MAKE_SPACE_FOR_CULTURE.dbf, which contains coordinates for facilities.
Make_space_for_culture_wgs84.zip: with shape files for WGS84 coordinate system and facility file MAKE_SPACE_FOR_CULTURE_WGS84.dbf, which contains coordinates for facilities.Note
signalizedTrafficPedestrianVolumes.csv file contains 24HrPedVol and 24HrVehVol columns. And after careful examination, we found that the 24HrPedVol = 2 x 8HrPedVol, and 24HrVehVol = 2 x 8HrVehVol. So we can drop the two columns without affecting our analysis.# Data collected in 2012
intersection_2012_df = pd.read_csv("./Data/ped_vol_2012.csv", header=None,
names=['PX', 'main', 'midblock_route', 'side1_route', 'side2_route', 'activation_date',
'latitude', 'longitude', 'count_date', '8hr_vel_vol', '8hr_ped_vol'],
skiprows=1)
intersection_2012_df.head()
# check to make sure data types are correct
intersection_2012_df.dtypes
# check number of records
intersection_2012_df.shape
# Data collected in 2013
intersection_2013_df = pd.read_csv("./Data/ped_vol_2013.csv",
names=['PX', 'main', 'midblock_route', 'side1_route', 'side2_route', 'activation_date',
'latitude', 'longitude', 'count_date', '8hr_vel_vol', '8hr_ped_vol'],
skiprows=1)
intersection_2013_df.head()
# check to make sure data types are correct
intersection_2013_df.dtypes
# check number of records
intersection_2013_df.shape
# Data collected in 2014
intersection_2014_df = pd.read_csv("./Data/ped_vol_2014.csv",
names=['PX', 'main', 'midblock_route', 'side1_route', 'side2_route', 'activation_date',
'latitude', 'longitude', 'count_date', '8hr_vel_vol', '8hr_ped_vol'],
skiprows=1)
intersection_2014_df.head()
# check to make sure data types are correct
intersection_2014_df.dtypes
# check number of records
intersection_2013_df.shape
# Data collected in 2014
intersection_2015_df = pd.read_csv("./Data/ped_vol_2015.csv",
names=['PX', 'main', 'midblock_route', 'side1_route', 'side2_route', 'activation_date',
'latitude', 'longitude', 'count_date', '8hr_vel_vol', '8hr_ped_vol'],
skiprows=1)
intersection_2015_df.head()
# check to make sure data types are correct
intersection_2015_df.dtypes
# check number of records
intersection_2015_df.shape
# Data collected in 1999-2012, in the signalizedTrafficPedestrianVolumes.csv file
# note that in the signalizedTrafficPedestrianVolumes.csv file
# the 8hr_vel_vol, 8hr_ped_vol, 24hr_vel_vol, and 24hr_ped_vol columns
# contains comma in the number for the thousands marker. So you need to use the
# keyword parameter thousands ="," to correly read the table.
# otherwise those columns will be returned as object(string) columns
intersection_1999_2012_df1 = pd.read_csv("./Data/signalizedTrafficPedestrianVolumes.csv",
names=['PX', 'main', 'midblock_route', 'side1_route', 'side2_route',
'activation_date','latitude', 'longitude', 'count_date',
'8hr_vel_vol', '8hr_ped_vol', '24hr_ped_vol', '24hr_veh_vol'],
skiprows=1, thousands=",")
intersection_1999_2012_df1.head()
# check to make sure data types are correct
intersection_1999_2012_df1.dtypes
# check number of records
intersection_1999_2012_df1.shape
intersection_1999_2012_df1['24hr_ped_vol'].describe()
# busiest intersection by pedestrian volume 1999-2012
intersection_1999_2012_df1.sort_values(by='24hr_ped_vol', ascending=False).head()
# busiest intersection by vehicle volume 1999-2012
intersection_1999_2012_df1.sort_values(by='24hr_veh_vol', ascending=False).head()
Note
About intersection_1999_2012_df dataframe, if you look carefully, the 24hr_ped_vol and 24hr_veh_vol are just 2x 8hr_ped_vol and 2x 8hr_veh_vol respectively. They are redundant columns and can be dropped.
intersection_1999_2012_df=intersection_1999_2012_df1.drop(['24hr_ped_vol', '24hr_veh_vol'], axis = 1)
intersection_1999_2012_df.head()
intersection_1999_2012_df1.dtypes
# check number of records
intersection_1999_2012_df.shape
# check the total number of intersection records
total_intersection_records = (intersection_1999_2012_df.shape[0]+ intersection_2012_df.shape[0]
+ intersection_2013_df.shape[0]+ intersection_2014_df.shape[0]
+ intersection_2015_df.shape[0])
print total_intersection_records
# combine all intersection data
frames =[intersection_1999_2012_df, intersection_2012_df,
intersection_2013_df, intersection_2014_df,
intersection_2015_df]
intersection_all_df = pd.concat(frames)
intersection_all_df.head()
# double check to the combined the dataframe has the same number of records
print total_intersection_records == intersection_all_df.shape[0]
# PX is a unique ID for an intersection, if we sort the combined dataframe by 'PX'
# We can see that some of the intersections have multiple records of pedestrian
# and vehicle volume data.
intersection_all_df.sort_values('PX')
We only want unique intersections in the final dataframe, so we need to average the multiple records for a single intersection.
intersection_ped_df data frame.PX, main, midblock_route, side1_route, side2_route, latitude, and longitude. Remove duplicates based on PX, so we have unique intersections.PX, average 8hr_vel_vol, and average 8hr_ped_volPX# first dataframe
unique_intersection_df = intersection_all_df[['PX', 'main', 'midblock_route', 'side1_route',
'side2_route', 'latitude', 'longitude']].drop_duplicates('PX')
unique_intersection_df.count() #2256 unique intersections
# seond dataframe
avg_ped_vel_df = intersection_all_df[['PX', '8hr_vel_vol', '8hr_ped_vol']].groupby('PX').mean()
avg_ped_vel_df.count()
avg_ped_vel_df.head()
# Join the two dataframes
# This dataframe will be the final dataframe for analysis use
intersection_ped_df = unique_intersection_df.join(avg_ped_vel_df, on='PX', how = 'inner')
intersection_ped_df.head()
intersection_ped_df.dtypes
intersection_ped_df.shape
In total, we have 2256 unique major intersections that have pedestrian and vehicle volume counted. We will select intersections from these ones to place map.
ADD_NUM = ADDRESS_NUMBER (Street number)
LF_NAME = LINEAR_NAME_FULL (Street Name)
ADDRESS = ADDRESS_FULL (Full address)
POSTAL_CD = POSTAL_CODE (POSTAL CODE)
CITY = CITY
X = X (Easting in MTM NAD27 3 degree Projection)
Y = Y (Northing in MTM NAD27 3 degree Projection)
LONGITUDE = LONGITUDE (LONGITUDE = Longitude in WGS84 Coordinate System)
LATITUDE = LATITUDE (Latitude in WGS84 Coordinate System)
FAC_NAM = FACILITY_NAME (FACILITY NAME)
STE_FLR_UN = SUITE_FLOOR_UNIT (SUITE FLOOR UNIT)
PERFRMANCE = PERFORMANCE (Spaces in which performing arts (dance, music, theatre, etc.) creation or presentation takes place)
EXHBVISARTT = EXHIBITION_VISUAL_ARTS (Spaces in which visual arts creation or presentation can take place, in addition to pure exhibition space.)
SCRN_BASED = SCREEN_BASED (Spaces for the production and presentation of multimedia screen-based arts including digital, )
LIBRARY = LIBRARY (Toronto Public Library facility with physical space for cultural activity)
MULTIPURP = MULTIPURPOSE (Spaces that are not purpose-built and can house a range of cultural activity across disciplines.)
HERITAGE = HERITAGE (Facilities where heritage activity takes place (historical societies, archives, community museums etc.)
OWNERSHIP = OWNERSHIP (OWNERSHIP)
OBJECTID = OBJECTID (Unique system identifier)
# use lower case column names for facility dataframe
col_names = ['add_num', 'lf_name', 'address', 'postal_cd', 'city', 'x', 'y', 'longitude', 'latitude', 'fac_name',
'ste_flr_un', 'performance', 'exhbvisart', 'scrn_based', 'library', 'multipurp', 'heritage', 'ownership', 'objectID']
# create facility_df datafrme
facility_df = pd.read_csv('./Data/facilities/make_space_for_culture_wgs84/MAKE_SPACE_FOR_CULTURE_WGS84.csv',
names = col_names, skiprows =1)
facility_df.head()
facility_df.dtypes
# 1397 cultural facilities
facility_df.shape
We have 1379 cultural facilities in City of Toronto. In next section, we will apply clustering algorithms to group the cultural faciliites into clusters according to their geographic coordinates.
In this section, we will check intersection pedestrian and vehicle volume distribution to see what pattern they follows or to see whether there is any interesting behavior in the data.
# decriptive of 8 hour pedestrian volume
intersection_ped_df['8hr_ped_vol'].describe()
The average 8 hour pedestrian volume is about 12,163 for the surveyed intersections. The smallest pedestrian volume is 0; the maximum pedestrian volume is 53678.
# pedestrian volume distribution
fig, ax = plt.subplots(figsize =(10, 8))
sns.distplot(intersection_ped_df['8hr_ped_vol'].values, bins= 100, kde=False)
ax.set_xlabel('Pedestrian Volume', fontsize = 20)
ax.set_ylabel('Number of Intersections', fontsize = 20)
ax.set_title('Pedestrian Volume Distribution', fontsize = 30)
Let's check what are intersections that have highest pedestrian volume and what are the intersections that have the lowest pedestrian volume.
# sort the dataframe by '8hr_ped_vol' from low to high
sorted_ped_df = intersection_ped_df.sort_values(by ='8hr_ped_vol')
# bottom 20 intersections with the lowest pedestrian volume
bottom20ped = sorted_ped_df[:20]
# Top 20 intesections with the highest pedestrian volume
top20ped = sorted_ped_df[-20:]
# bottom 20 intersections with the lowest pedestrian volume
bottom20ped
### Plot 20 intersections with lowest pedestrian volume on interactive Map
bottom20ped_map = folium.Map(location = [43.6532, -79.3832])
bottom20ped_map.save('bottom20ped.html')
marker_cluster_intersection = folium.MarkerCluster().add_to(bottom20ped_map)
for index, row in bottom20ped.iterrows():
folium.Marker([row["latitude"],row["longitude"]] ).add_to(marker_cluster_intersection )
bottom20ped_map.save('bottom20ped.html')
bottom20ped_map
# Top 20 intesections with the highest pedestrian volume
top20ped
### Plot 20 intersections with highest pedestrian volume on interactive Map
top20ped_map = folium.Map(location = [43.6532, -79.3832])
top20ped_map.save('top20ped.html')
marker_cluster_intersection = folium.MarkerCluster().add_to(top20ped_map)
for index, row in top20ped.iterrows():
folium.Marker([row["latitude"],row["longitude"]] ).add_to(marker_cluster_intersection )
top20ped_map.save('top20ped.html')
top20ped_map
It is surprising to see that none of the top 20 intersections with the highest pedestrian volume locate in core downtown of the City of Toronto. I have some doubt about the accuracy of the pedestrain volume data.
# decriptive of 8 hour vehicle volume
intersection_ped_df['8hr_vel_vol'].describe()
The average vehicle volume is 6003. The minimum vehicle volume is 0; the maximum vehicle volume is 38815.
# vehicle volume distribution
fig, ax = plt.subplots(figsize =(10, 8))
sns.distplot(intersection_ped_df['8hr_vel_vol'].values, bins= 100, kde=False)
ax.set_xlabel('Vehicle Volume', fontsize = 20)
ax.set_ylabel('Number of Intersections', fontsize = 20)
ax.set_title('Vehicle Volume Distribution', fontsize = 30)
sorted_veh_df = intersection_ped_df.sort_values(by='8hr_vel_vol')
# bottom 20 intersections with the lowest vehicle volume
bottom20veh = sorted_veh_df[:20]
# Top 20 intesections with the highest vehicle volume
top20veh = sorted_veh_df[-20:]
# 20 intersections with the lowest vehicle volume
bottom20veh
### Plot 20 intersections with lowest pedestrian volume on interactive Map
bottom20veh_map = folium.Map(location = [43.6532, -79.3832])
bottom20veh_map.save('bottom20veh.html')
marker_cluster_intersection = folium.MarkerCluster().add_to(bottom20veh_map)
for index, row in bottom20veh.iterrows():
folium.Marker([row["latitude"],row["longitude"]]).add_to(marker_cluster_intersection )
bottom20veh_map.save('bottom20veh.html')
bottom20veh_map
# 20 intersections with the highest vehicle volume
top20veh
### Plot 20 intersections with highest pedestrian volume on interactive Map
top20veh_map = folium.Map(location = [43.6532, -79.3832])
top20veh_map.save('top20veh.html')
marker_cluster_intersection = folium.MarkerCluster().add_to(top20veh_map)
for index, row in top20ped.iterrows():
folium.Marker([row["latitude"],row["longitude"]]).add_to(marker_cluster_intersection )
top20veh_map.save('top20veh.html')
top20veh_map
In this section, we will show the geograhic distributions of intersections and facilites using geographic coordinates
# plot intersections using their geographic coordinates
fig, ax = plt.subplots(figsize=[20, 12])
intersection_scatter = ax.scatter(intersection_ped_df['longitude'], intersection_ped_df['latitude'],
c='b', edgecolor='None', alpha=0.9, s=12)
ax.set_title('City of Toronto Major Intersections', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=20)
ax.set_ylabel('Latitude', fontsize =20)
ax.legend([intersection_scatter], ['Intersections'], loc='upper right', fontsize = 20)
plt.show()
From the graph, we can see the shape of City of Toronto and some of the main streets.
Note: For future imporvement, we can add the City of Toronto basemap to the plot for better visualization
### Plot intersections on interactive Map
intersection_map = folium.Map(location = [43.6532, -79.3832])
intersection_map.save('intersections.html')
marker_cluster_intersection = folium.MarkerCluster().add_to(intersection_map)
for index, row in intersection_ped_df.iterrows():
folium.Marker([row["latitude"],row["longitude"]] ).add_to(marker_cluster_intersection )
intersection_map.save('intersections.html')
intersection_map
The interactive map gives up more context about each intersection
# plot Cultural facilities distribution based on
# geographic coordinates
fig, ax = plt.subplots(figsize=[20, 12])
facility_scatter = ax.scatter(facility_df['longitude'], facility_df['latitude'],
c='#99cc99', edgecolor='None', alpha=0.9, s=120)
ax.set_title('City of Toronto Cultural Facilities', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=20)
ax.set_ylabel('Latitude', fontsize =20)
ax.legend([facility_scatter], ['Cultural Facilities'], loc='upper right', fontsize = 20)
plt.show()
It is no surprise that cultural facilities are condensed in downtown area of City of Toronto.
Plot Cultural Facilities on Interactive Map
# Cultural Facilities on Interactive Map
facility_map = folium.Map(location = [43.6532, -79.3832])
facility_map.save('facilities.html')
marker_cluster_facility = folium.MarkerCluster().add_to(facility_map)
for index, row in facility_df.iterrows():
folium.Marker([row["latitude"],row["longitude"]] ).add_to(marker_cluster_facility)
facility_map.save('facilities.html')
facility_map
We first tried DBSCAN algorithm in scikit-learn for clustering. DBSCAN is base on the paper:
This algorithm is aimed for spatial data, which might be a good fit for our case.
The following is a description of DBSCAN algorithm from sklearn:
The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There are two parameters to the algorithm,
min_samplesandeps, which define formally what we mean when we say dense. Highermin_samplesor lowerepsindicate higher density necessary to form a cluster.
DBSCAN algorithm clusters dataset based on two parameters:
eps - The max distance between neighbour points to be considerted in a cluster
min_samples - the minimum cluster size. If it is set to 1, it means every data point will be assigned to either a cluster or form its own cluster of 1 data point. If min_sample is set to be larger than one, then cluster with size less than min_sample will be considered as noise.
The scikit-learn DBSCAN haversine distance metric requires data in the form of [latitude, longitude] and both inputs and outputs are in units of radians.
Since in our facility dataset, we need to consider all facilities, we don't want any of them to be classified as noise, so we set min_samples=1. Now our clustering will depend on a proper eps value.
Let's try esp = 1.5, 1.0, 0.7 (km)
Note that eps need to be converted to radians for use by harversine
# define the number of kilometers in one radiation
# which will be used to convert esp from km to radiation
kms_per_rad = 6371.0088
# define a function to calculate the geographic coordinate
# centroid of a cluster of geographic points
# it will be used later to calculate the centroids of DBSCAN cluster
# because Scikit-learn DBSCAN cluster algorithm does not calculate centroid
def get_centroid(cluster):
"""calculate the centroid of a cluster of geographic coordinate points
Args:
cluster: cluster coordinates, nx2 array-like (array, list of lists, etc)
n is the number of coordinate points(latitude, longitude)in the cluster.
Return:
centroid: numpy array, geometry centroid of the cluster
"""
cluster_ary = np.asarray(cluster)
centroid = cluster_ary.mean(axis = 0)
return centroid
# testing get_centroid function
test_cluster= [[ 43.70487299, -79.57753802],
[ 43.71138367, -79.56524418],
[ 43.72616079, -79.57319998],
[ 43.73547907, -79.56258364],
[ 43.72070325, -79.57202018],
[ 43.73126031, -79.5598719 ]]
test_centroid = get_centroid(test_cluster)
print test_centroid
print type(test_centroid)
# convert eps to radians for use by haversine
epsilon = 1.5/kms_per_rad
# Extract intersection coordinates (latitude, longitude)
fac_coords = facility_df.as_matrix(columns = ['latitude', 'longitude'])
start_time = time.time()
dbsc = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine')
.fit(np.radians(fac_coords)))
fac_cluster_labels = dbsc.labels_
# get the number of clusters
num_clusters = len(set(dbsc.labels_))
# print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(facility_df), num_clusters, 100*(1 - float(num_clusters) / len(facility_df)), time.time()-start_time))
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(fac_coords, fac_cluster_labels)))
# turn the clusters into a pandas series,where each element is a cluster of points
dbsc_clusters = pd.Series([fac_coords[fac_cluster_labels==n] for n in range(num_clusters)])
# get centroid of each cluster
fac_centroids = dbsc_clusters.map(get_centroid)
# unzip the list of centroid points (lat, lon) tuples into separate lat and lon lists
cent_lats, cent_lons = zip(*fac_centroids)
# from these lats/lons create a new df of one representative point for eac cluster
centroids_df = pd.DataFrame({'longitude':cent_lons, 'latitude':cent_lats})
# Plot the facility clusters and cluster centroid
fig, ax = plt.subplots(figsize=[20, 12])
facility_scatter = ax.scatter(facility_df['longitude'], facility_df['latitude'],
c=fac_cluster_labels, cmap = cm.Dark2,
edgecolor='None', alpha=0.7, s=120)
centroid_scatter = ax.scatter(centroids_df['longitude'], centroids_df['latitude'],
marker='x', linewidths=2, c='k', s=50)
ax.set_title('DBSCAN eps = 1.5, Facility Clusters & Facility Cluster Centroids', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize = 24)
ax.legend([facility_scatter, centroid_scatter],
['Facilities', 'Facility Cluster Centroid'],
loc='upper right', fontsize = 20)
plt.show()
As we can see when we set eps = 1.5, the DBSCAN algorithm groups the cultural facilities into 20 clusters based on their geographic coordinates. The figure further proved the concept that:
"DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped."
Obviously, 20 cluster is not enough. We need a lot more than 20 intersections to place map. Let's adjust the eps parameter and see what result we can get.
# convert eps to radians for use by haversine
epsilon = 0.7/kms_per_rad
# Extract intersection coordinates (latitude, longitude)
fac_coords = facility_df.as_matrix(columns = ['latitude', 'longitude'])
start_time = time.time()
dbsc = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine')
.fit(np.radians(fac_coords)))
fac_cluster_labels = dbsc.labels_
# get the number of clusters
num_clusters = len(set(dbsc.labels_))
# print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(facility_df), num_clusters, 100*(1 - float(num_clusters) / len(facility_df)), time.time()-start_time))
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(fac_coords, fac_cluster_labels)))
# turn the clusters into a pandas series,where each element is a cluster of points
dbsc_clusters = pd.Series([fac_coords[fac_cluster_labels==n] for n in range(num_clusters)])
# get centroid of each cluster
fac_centroids = dbsc_clusters.map(get_centroid)
# unzip the list of centroid points (lat, lon) tuples into separate lat and lon lists
cent_lats, cent_lons = zip(*fac_centroids)
# from these lats/lons create a new df of one representative point for eac cluster
centroids_df = pd.DataFrame({'longitude':cent_lons, 'latitude':cent_lats})
# Plot the facility clusters and cluster centroid
fig, ax = plt.subplots(figsize=[20, 12])
facility_scatter = ax.scatter(facility_df['longitude'], facility_df['latitude'],
c=fac_cluster_labels, cmap = cm.Dark2, edgecolor='None', alpha=0.7, s=120)
centroid_scatter = ax.scatter(centroids_df['longitude'], centroids_df['latitude'],
marker='x', linewidths=2, c='k', s=50)
ax.set_title('DBSCAN eps = 0.7, Facility Clusters & Facility Cluster Centroids', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize = 24)
ax.legend([facility_scatter, centroid_scatter], ['Facilities', 'Facility Cluster Centroid'],
loc='upper right', fontsize = 20)
plt.show()
When we set eps = 0.7, the DBSCAN algorithm group the facilities into 185 clusters. From the plot we can clearly see that the core of downtown is a huge cluster, accounting for more than half of all facilities, while most other cluster only has one facility. This is not a resonable clustering that we can base on to put maps. Let's try eps = 1.0 km and see if we can get a reasonable result.
# convert eps to radians for use by haversine
epsilon = 1.0/kms_per_rad
# Extract intersection coordinates (latitude, longitude)
fac_coords = facility_df.as_matrix(columns = ['latitude', 'longitude'])
start_time = time.time()
dbsc = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine')
.fit(np.radians(fac_coords)))
fac_cluster_labels = dbsc.labels_
# get the number of clusters
num_clusters = len(set(dbsc.labels_))
# print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(facility_df), num_clusters, 100*(1 - float(num_clusters) / len(facility_df)), time.time()-start_time))
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(fac_coords, fac_cluster_labels)))
# turn the clusters into a pandas series,where each element is a cluster of points
dbsc_clusters = pd.Series([fac_coords[fac_cluster_labels==n] for n in range(num_clusters)])
# get centroid of each cluster
fac_centroids = dbsc_clusters.map(get_centroid)
# unzip the list of centroid points (lat, lon) tuples into separate lat and lon lists
cent_lats, cent_lons = zip(*fac_centroids)
# from these lats/lons create a new df of one representative point for eac cluster
centroids_df = pd.DataFrame({'longitude':cent_lons, 'latitude':cent_lats})
# Plot the facility clusters and cluster centroid
fig, ax = plt.subplots(figsize=[20, 12])
facility_scatter = ax.scatter(facility_df['longitude'], facility_df['latitude'],c=fac_cluster_labels,
cmap = cm.Dark2, edgecolor='None', alpha=0.7, s=120)
centroid_scatter = ax.scatter(centroids_df['longitude'], centroids_df['latitude'],
marker='x', linewidths=2, c='k', s=50)
ax.set_title('DBSCAN eps = 1.0, Facility Clusters & Facility Cluster Centroids', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize=24)
ax.legend([facility_scatter, centroid_scatter], ['Facilities', 'Facility Cluster Centroid'],
loc='upper right', fontsize = 20)
plt.show()
eps = 1.0. the DBSCAN algorithm divided the facilities into 93 clusters. Again, downtown and middle down facilites are in the same huge cluster, and other facilities are rather nicely grouped. Based on these findings we learned that DBSCAN alone is not good enough for our needs, but it can be used to select the areas where facilities are dense and help implement a phase-wise plan for rolling out maps throughout the City of Toronto.
The citywide roll-out of the map placement will be implemented through a 3-year plan as shown in the following map.

DBSCAN algorithm can help to select the facilities for each phase of implementation.
# cluster labels based on eps = 1.0
fac_cluster_labels
# add cluster label column to all facilites
labeled_facility_df = facility_df
labeled_facility_df['label'] = fac_cluster_labels
labeled_facility_df.head()
labeled_facility_df.groupby('label').count()['add_num']
Downtown cluster is very condensed, it includes 1064 facilities out of the total 1397 facilities. Let's extract Downtown facilities basen on previous DBSCAN clustering(eps = 1.0 km)
downtown_fac_df = labeled_facility_df[labeled_facility_df['label']==0]
downtown_fac_df.head()
# Plot downtown facilities
fig, ax = plt.subplots(figsize=[20, 12])
downtown_fac_scatter = ax.scatter(downtown_fac_df['longitude'], downtown_fac_df['latitude'],
c='#99cc99', edgecolor='None', alpha=0.7, s=120)
ax.set_title('Downtown Facilities', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize=24)
ax.legend([downtown_fac_scatter], ['Downtown Facilities'],
loc='upper right', fontsize = 20)
plt.show()
Set eps = 0.4 km to further clustering downtown facilities
# convert eps to radians for use by haversine
epsilon = 0.4/kms_per_rad
# Extract intersection coordinates (latitude, longitude)
dt_fac_coords = downtown_fac_df.as_matrix(columns = ['latitude', 'longitude'])
start_time = time.time()
dbsc_dt = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine')
.fit(np.radians(dt_fac_coords)))
dt_labels = dbsc_dt.labels_
# get the number of clusters
num_clusters = len(set(dbsc_dt.labels_))
# print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(downtown_fac_df), num_clusters, 100*(1 - float(num_clusters) / len(downtown_fac_df)), time.time()-start_time))
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(dt_fac_coords, dt_labels)))
# turn the clusters into a pandas series,where each element is a cluster of points
dt_clusters = pd.Series([dt_fac_coords[dt_labels==n] for n in range(num_clusters)])
# get centroid of each cluster
dt_fac_centroids = dt_clusters.map(get_centroid)
# unzip the list of centroid points (lat, lon) tuples into separate lat and lon lists
cent_lats, cent_lons = zip(*dt_fac_centroids)
# from these lats/lons create a new df of one representative point for eac cluster
dt_centroids_df = pd.DataFrame({'longitude':cent_lons, 'latitude':cent_lats})
# Plot the downtown facility clusters and cluster centroid
fig, ax = plt.subplots(figsize=[20, 12])
dt_fac_scatter = ax.scatter(downtown_fac_df['longitude'], downtown_fac_df['latitude'],c=dt_labels,
cmap = cm.Dark2, edgecolor='None', alpha=0.7, s=120)
dt_centroid_scatter = ax.scatter(dt_centroids_df['longitude'], dt_centroids_df['latitude'],
marker='x', linewidths=2, c='k', s=50)
ax.set_title('DBSCAN eps = 0.4, Downdown Facilities & Cluster Centroids', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize=24)
ax.legend([facility_scatter, centroid_scatter], ['Downtown Facilities', 'Cluster Centroid'],
loc='upper right', fontsize = 20)
plt.show()
# add cluster label column to downtown facilities
labeled_downtown_fac_df = downtown_fac_df.drop(['label'], axis=1)
labeled_downtown_fac_df['label'] = dt_labels
labeled_downtown_fac_df
# check which label correspond to core downtown cluster
labeled_downtown_fac_df.groupby('label').count()['add_num']
Core downtown cluster's label is 7, and Core downtown contains 655 facilities. Now, let's extract the core downtown facilities
# facilities in the core of downtown
downtown_core_fac_df = labeled_downtown_fac_df[labeled_downtown_fac_df['label']==7]
downtown_core_fac_df
There are 655 facilities in the core of downtown
# Plot core downtown facilities
fig, ax = plt.subplots(figsize=[20, 12])
downtown_core_fac_scatter = ax.scatter(downtown_core_fac_df['longitude'], downtown_core_fac_df['latitude'],
c='#99cc99', edgecolor='None', alpha=0.7, s=120)
ax.set_title('Core Downtown Facilities', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize=24)
ax.legend([downtown_fac_scatter], ['Core Downtown Facilities'],
loc='upper right', fontsize = 20)
plt.show()
Now, we have three groups of facilities: core downtown facilities, downtown and main streets facilities, and out core facilities. With these three groups of facilities, we can implement the 3 year phase-wise plan for map roll-out. For each group of facilites we can then use KMeans clustering algorithm to further clustering them into small groups and then we can put a map near the centroid of each cluster of facilities.
In this section, we use KMeans clustering algorithm to group cultural facilities according to their geographic coordinates.
Note
Here we apply KMeans clustering to the entire facility dataset to get a general idea of how KMeans algorithm works for our problem.
# use scikit-learn KMeans algorithm
n_clusters = 200
fac_coords = facility_df.as_matrix(columns=['latitude', 'longitude'])
KM200_model = KMeans(n_clusters = n_clusters)
KM200_model.fit(fac_coords)
# turn the clusters in to a pandas series, where each element is a cluster of points
KM200_clusters = pd.Series([fac_coords[KM200_model.labels_==n] for n in range(n_clusters )])
# centroids of 200 clusters
fac200_centroids = KM200_model.cluster_centers_
# Labels
KM200_label = KM200_model.labels_
# Plot acility clusters and cluster centroid
fig, ax = plt.subplots(figsize=[20, 12])
facility_scatter = ax.scatter(facility_df['longitude'], facility_df['latitude'], c=KM200_label,
cmap = cm.Dark2, edgecolor='None', alpha=0.7, s=120)
centroid_scatter = ax.scatter(fac200_centroids[:,1], fac200_centroids[:,0], marker='x', linewidths=2, c='k', s=30)
ax.set_title('KMeans 200 Facility Clusters & Facility Cluster Centroids', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize = 24)
ax.legend([facility_scatter, centroid_scatter], ['Facilities', 'Facility Cluster Centroid'], loc='upper right', fontsize = 20)
plt.show()
From the above plot we can see that 200 clusters might not be enough because the distance a map needs to cover is too long.
# Add cluster labels to facility_df dataframe
labeled_KM200_facility_df = facility_df
labeled_KM200_facility_df = facility_df
labeled_KM200_facility_df['label'] = KM200_label
labeled_KM200_facility_df.head()
labeled_KM200_facility_df.groupby('label').count()[['add_num', 'address']].describe()
The smallest cluster has 1 facility, the largest clusters contains 46 facilities
Let's increase the number of clusters to 300
n_clusters = 300
fac_coords = facility_df.as_matrix(columns=['latitude', 'longitude'])
KM300_model = KMeans(n_clusters =n_clusters)
KM300_model.fit(fac_coords)
# turn the clusters in to a pandas series, where each element is a cluster of points
KM300_clusters = pd.Series([fac_coords[KM300_model.labels_==n] for n in range(n_clusters )])
# centroids of 200 clusters
fac300_centroids = KM300_model.cluster_centers_
# Labels
KM300_label = KM300_model.labels_
# Plot the facility clusters and cluster centroids
fig, ax = plt.subplots(figsize=[20, 12])
facility_scatter = ax.scatter(facility_df['longitude'], facility_df['latitude'], c=KM300_label,
cmap = cm.Dark2, edgecolor='None', alpha=0.7, s=120)
centroid_scatter = ax.scatter(fac300_centroids[:,1], fac300_centroids[:,0],
marker='x', linewidths=2, c='k', s=30)
ax.set_title('KMeans 300 Facility Clusters & Facility Cluster Centroids', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize = 24)
ax.legend([facility_scatter, centroid_scatter],
['Facilities', 'Facility Cluster Centroid'],
loc='upper right', fontsize = 20)
plt.show()
# Plot facilities and facility cluster centroids on interactive map
facility_centroid_map = folium.Map(location = [43.6532, -79.3832])
facility_centroid_map.save('facility_centroids.html')
# Add markers of facilities
facility_cluster = folium.MarkerCluster().add_to(facility_centroid_map)
for index, row in facility_df.iterrows():
folium.Marker([row["latitude"],row["longitude"]] ).add_to(facility_cluster)
# Add markers of clusters centroids
centroid_cluster = folium.MarkerCluster().add_to(facility_centroid_map)
for index, row in centroids_df.iterrows():
folium.RegularPolygonMarker([row['latitude'], row['longitude']],
fill_color='#769d96', number_of_sides=8, radius=6, popup='cluster centroid').add_to(centroid_cluster)
facility_centroid_map.save('facility_centroids.html')
facility_centroid_map
# Add cluster labels to facility_df dataframe
labeled_KM300_facility_df = facility_df
labeled_KM300_facility_df = facility_df
labeled_KM300_facility_df['label'] = KM300_label
labeled_KM300_facility_df.head()
labeled_KM300_facility_df.groupby('label').count()[['add_num', 'address']].describe()
The smallest cluster has 1 facility, the largest cluster has 43 facility.
We will use 300 clusters as the prototype for later selection of intersections for map placement.
Now we have 300 facilities clusters and their centroids, our goal is to find an appropriate intersection to place map for each facility cluster as shown in the following plot.
We compared two cases:
Case 1: Select the intersection that is closest to the centroid of a cluster
Case 2: Select the three closest intersections to the centroid of a cluster, then among the three the closest intersections, we select the one with the highest pedestrian volume.
KDTree Algorithm
To find the closest intersection to a cluster centroid, we will use KDTree algorithm.
# plot facility cluster centroids and intersections
fig, ax = plt.subplots(figsize=[20, 12])
intersection_scatter = ax.scatter(intersection_ped_df['longitude'],
intersection_ped_df['latitude'],
c='b', edgecolor='None', alpha=0.9, s=12)
KM300_centroid_scatter = ax.scatter(fac300_centroids[:,1], fac300_centroids[:,0],
linewidths=2, c='k', s=40)
ax.set_title('300 Facility Clusters Centroids & 2256 Intersections', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=20)
ax.set_ylabel('Latitude', fontsize =20)
ax.legend([intersection_scatter, KM300_centroid_scatter],
['Intersections', 'Facilities Cluster Centroids'], loc='upper right', fontsize = 20)
plt.show()
The small blue dots represent intersections, the big black dots represent facility cluster centroids. Our goal now is to find the closet intersection for each cluster centroid.
intersection_coords = intersection_ped_df.as_matrix(columns=['latitude', 'longitude'])
# store index of cloest intersection to each cluster centroid
closest_intersection_index =[]
# find the closest points using scipy KDTree algorithm
for i in range(len(fac300_centroids)):
distance, index = spatial.KDTree(intersection_coords).query(fac300_centroids[i])
closest_intersection_index.append(index)
# have a look at the index
closest_intersection_index
len(closest_intersection_index)
# one intersection may be the closet to multiple cluster centroids
# especially in downtown where facilities are condensed.
len(set(closest_intersection_index))
map_intersection_index = list(set(closest_intersection_index))
len(map_intersection_index)
# selected intersections for map placement
map_intersection_df = intersection_ped_df.iloc[map_intersection_index]
map_intersection_df
# Check the pedestrian volume of the select 296 intersections
map_intersection_df['8hr_ped_vol'].describe()
The average 8 hour pedestrian volume is 11246.8 for the select 296 intersections
# Plot the selected 296 Intersections for Map Placement
fig, ax = plt.subplots(figsize=[20, 12])
intersection_scatter = ax.scatter(intersection_ped_df['longitude'],
intersection_ped_df['latitude'],
c='b', edgecolor='None', alpha=0.9, s=8)
facility_scatter = ax.scatter(facility_df['longitude'], facility_df['latitude'], c=KM300_label,
cmap = cm.Dark2, edgecolor='None', alpha=0.7, s=120)
map_intersection_scatter = ax.scatter(map_intersection_df['longitude'],
map_intersection_df['latitude'],
marker='x', linewidths=2, c='k', s=30)
ax.set_title('300 Facility Clusters & 296 Closest Intersections for Map Placment', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize = 24)
ax.legend([intersection_scatter, facility_scatter, map_intersection_scatter],
['Intersections', 'Facilities', 'Selected Intersections for Map Placement'],
loc='lower right', fontsize = 20)
plt.show()
# store index of cloest 3 intersections to each cluster centroid
cloest_intersection_index2 =[]
# Find the closest 3 intersections using KDTree
for i in range(len(fac300_centroids)):
distance, index = spatial.KDTree(intersection_coords).query(fac300_centroids[i], 3)
cloest_intersection_index2.append(index)
# The returned index is a list of array
cloest_intersection_index2
len(cloest_intersection_index2)
# create an empty dataframe with same schema as the intersection_ped_df
map_intersection_df2 = intersection_ped_df[intersection_ped_df['PX']<0]
map_intersection_df2
print map_intersection_df2.columns
print map_intersection_df2.dtypes
print map_intersection_df2.empty
for ary in cloest_intersection_index2:
# five cloest intersection to a facility centroid
temp_df = intersection_ped_df.iloc[ary]
# select the max pedestrian volume intersection
# among the 3 cloest intersections
max_ped_df = temp_df.ix[temp_df['8hr_ped_vol'].idxmax()]
# appending the selected intersection row to map_intersection_df2
map_intersection_df2 = map_intersection_df2.append(max_ped_df)
# 300 intersections, duplicates may exist
map_intersection_df2
# drop duplicates, and 289 unique intersections were selected
# for map placement for the 300 facility clusters
map_intersection_df2 = map_intersection_df2.drop_duplicates()
map_intersection_df2
# Check the pedestrian volume of the select intersections
map_intersection_df2['8hr_ped_vol'].describe()
If we consider both distance and pedestrian volume, the average 8 hour pedestrian volume is 16004.4 for the selected 286 intersections. In other words intersections selected by this method has in average 4758 more pedestrian volume or 42% pedestrian volume in 8 Hr period compare to the intersection selected by only considering distance. This is a huge difference. Therefore, we will select intersections using the second method, which considers both distance and pedestrian volume.
# Plot select 289 Intersections that for Map Placement
fig, ax = plt.subplots(figsize=[20, 12])
intersection_scatter = ax.scatter(intersection_ped_df['longitude'],
intersection_ped_df['latitude'],
c='b', edgecolor='None', alpha=0.9, s=8)
facility_scatter = ax.scatter(facility_df['longitude'], facility_df['latitude'], c=KM300_label,
cmap = cm.Dark2, edgecolor='None', alpha=0.7, s=120)
map_intersection_scatter = ax.scatter(map_intersection_df2['longitude'],
map_intersection_df2['latitude'],
marker='x', linewidths=2, c='k', s=30)
ax.set_title('300 Facility Clusters & 289 Intersections for Map Placment', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize = 24)
ax.legend([intersection_scatter, facility_scatter, map_intersection_scatter],
['Intersections', 'Facilities', 'Selected Intersections for Map Placement'],
loc='lower right', fontsize = 20)
plt.show()
# Plot selected intersection on interactive map
selected_intersection_map = folium.Map(location = [43.6532, -79.3832])
selected_intersection_map.save('selected_intersection.html')
# Add markers of facilities
intersection_cluster = folium.MarkerCluster().add_to(selected_intersection_map)
for index, row in map_intersection_df2.iterrows():
folium.Marker([row["latitude"],row["longitude"]] ).add_to(intersection_cluster)
selected_intersection_map
From this interactive map, we can see the final selected approximate 300 intersections for map placement if we divide the 1397 cultural facilities into 300 clusters.
There is a lot we can do to improve our current soltuion.